AITopics

2512.07019

Country:

North America > United States > Texas > Travis County > Austin (0.04)
North America > United States > Michigan (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Europe > Ukraine > Kyiv Oblast > Kyiv (0.04)

Genre: Research Report (1.00)

Industry: Education (0.45)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Sühr, Tom, Dorner, Florian E., Salaudeen, Olawale, Kelava, Augustin, Samadi, Samira

Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead

arXiv.org Artificial IntelligenceAug-1-2025

Large Language Models (LLMs) have achieved remarkable results on a range of standardized tests originally designed to assess human cognitive and psychological traits, such as intelligence and personality. While these results are often interpreted as strong evidence of human-like characteristics in LLMs, this paper argues that such interpretations constitute an ontological error. Human psychological and educational tests are theory-driven measurement instruments, calibrated to a specific human population. Applying these tests to non-human subjects without empirical validation, risks mischaracterizing what is being measured. Furthermore, a growing trend frames AI performance on benchmarks as measurements of traits such as ``intelligence'', despite known issues with validity, data contamination, cultural bias and sensitivity to superficial prompt changes. We argue that interpreting benchmark performance as measurements of human-like traits, lacks sufficient theoretical and empirical justification. This leads to our position: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead. We call for the development of principled, AI-specific evaluation frameworks tailored to AI systems. Such frameworks might build on existing frameworks for constructing and validating psychometrics tests, or could be created entirely from scratch to fit the unique context of AI.

large language model, machine learning, natural language, (16 more...)

2507.23009

Country:

North America > United States (1.00)
Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.14)

Genre: Research Report > New Finding (0.46)

Industry:

Law (1.00)
Education > Assessment & Standards (1.00)
Health & Medicine > Therapeutic Area (0.93)
Government > Regional Government > North America Government > United States Government (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

arXiv.org Artificial IntelligenceJul-15-2025

Generative Cognitive Diagnosis

Li, Jiatong, Liu, Qi, Zhu, Mengxiao

Cognitive diagnosis (CD) models latent cognitive states of human learners by analyzing their response patterns on diagnostic tests, serving as a crucial machine learning technique for educational assessment and evaluation. Traditional cognitive diagnosis models typically follow a transductive prediction paradigm that optimizes parameters to fit response scores and extract learner abilities. These approaches face significant limitations as they cannot perform instant diagnosis for new learners without computationally expensive retraining and produce diagnostic outputs with limited reliability. In this study, we introduces a novel generative diagnosis paradigm that fundamentally shifts CD from predictive to generative modeling, enabling inductive inference of cognitive states without parameter re-optimization. We propose two simple yet effective instantiations of this paradigm: Generative Item Response Theory (G-IRT) and Generative Neural Cognitive Diagnosis Model (G-NCDM), which achieve excellent performance improvements over traditional methods. The generative approach disentangles cognitive state inference from response prediction through a well-designed generation process that incorporates identifiability and monotonicity conditions. Extensive experiments on real-world datasets demonstrate the effectiveness of our methodology in addressing scalability and reliability challenges, especially $\times 100$ speedup for the diagnosis of new learners. Our framework opens new avenues for cognitive diagnosis applications in artificial intelligence, particularly for intelligent model evaluation and intelligent education systems. The code is available at https://github.com/CSLiJT/Generative-CD.git.

artificial intelligence, learner, machine learning, (18 more...)

2507.09831

Country: North America > United States (0.46)

Genre: Research Report > New Finding (0.48)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Diagnosis (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Wallmark, Joakim, Josefsson, Maria, Wiberg, Marie

Introducing Flexible Monotone Multiple Choice Item Response Theory Models and Bit Scales

arXiv.org Machine LearningOct-2-2024

Item Response Theory (IRT) is a powerful statistical approach for evaluating test items and determining test taker abilities through response analysis. An IRT model that better fits the data leads to more accurate latent trait estimates. In this study, we present a new model for multiple choice data, the monotone multiple choice (MMC) model, which we fit using autoencoders. Using both simulated scenarios and real data from the Swedish Scholastic Aptitude Test, we demonstrate empirically that the MMC model outperforms the traditional nominal response IRT model in terms of fit. Furthermore, we illustrate how the latent trait scale from any fitted IRT model can be transformed into a ratio scale, aiding in score interpretation and making it easier to compare different types of IRT models. We refer to these new scales as bit scales. Bit scales are especially useful for models for which minimal or no assumptions are made for the latent trait scale distributions, such as for the autoencoder fitted models in this study.

irt model, mmc model, test taker, (14 more...)

2410.0148

Country:

North America > United States > Tennessee > Knox County > Knoxville (0.04)
North America > United States > Massachusetts > Middlesex County > Reading (0.04)
North America > United States > Florida > Palm Beach County > Boca Raton (0.04)
(2 more...)

Genre: Research Report > New Finding (1.00)

Industry: Education > Assessment & Standards > Assessment Methods (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Peereboom, Sanne, Schwabe, Inga, Kleinberg, Bennett

Cognitive phantoms in LLMs through the lens of latent variables

arXiv.org Artificial IntelligenceSep-6-2024

Large language models (LLMs) increasingly reach real-world applications, necessitating a better understanding of their behaviour. Their size and complexity complicate traditional assessment methods, causing the emergence of alternative approaches inspired by the field of psychology. Recent studies administering psychometric questionnaires to LLMs report human-like traits in LLMs, potentially influencing LLM behaviour. However, this approach suffers from a validity problem: it presupposes that these traits exist in LLMs and that they are measurable with tools designed for humans. Typical procedures rarely acknowledge the validity problem in LLMs, comparing and interpreting average LLM scores. This study investigates this problem by comparing latent structures of personality between humans and three LLMs using two validated personality questionnaires. Findings suggest that questionnaires designed for humans do not validly measure similar constructs in LLMs, and that these constructs may not exist in LLMs at all, highlighting the need for psychometric analyses of LLM responses to avoid chasing cognitive phantoms. Keywords: large language models, psychometrics, machine behaviour, latent variable modeling, validity

dimension, llm, questionnaire, (17 more...)

2409.15324

Country:

Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
North America > United States > New York > New York County > New York City (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)
Europe > Netherlands (0.04)

Genre:

Questionnaire & Opinion Survey (1.00)
Research Report > New Finding (0.88)
Research Report > Experimental Study (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.54)

Romero, Peter, Fitz, Stephen, Nakatsuma, Teruo

Do GPT Language Models Suffer From Split Personality Disorder? The Advent Of Substrate-Free Psychometrics

arXiv.org Artificial IntelligenceAug-15-2024

Previous research on emergence in large language models shows these display apparent human-like abilities and psychological latent traits. However, results are partly contradicting in expression and magnitude of these latent traits, yet agree on the worrisome tendencies to score high on the Dark Triad of narcissism, psychopathy, and Machiavellianism, which, together with a track record of derailments, demands more rigorous research on safety of these models. We provided a state of the art language model with the same personality questionnaire in nine languages, and performed Bayesian analysis of Gaussian Mixture Model, finding evidence for a deeper-rooted issue. Our results suggest both interlingual and intralingual instabilities, which indicate that current language models do not develop a consistent core personality. This can lead to unsafe behaviour of artificial intelligence systems that are based on these foundation models, and are increasingly integrated in human life. We subsequently discuss the shortcomings of modern psychometrics, abstract it, and provide a framework for its species-neutral, substrate-free formulation.

gpt language model suffer, gpt-3, split personality disorder, (11 more...)

doi: 10.21203/rs.3.rs-2717108/v1

2408.07377

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
(2 more...)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Personality Disorder > Antisocial Personality Disorder (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)

Duck-Mayr, JBrandon, Garnett, Roman, Montgomery, Jacob M.

GPIRT: A Gaussian Process Model for Item Response Theory

arXiv.org Machine LearningJun-17-2020

The goal of item response theoretic (IRT) models is to provide estimates of latent traits from binary observed indicators and at the same time to learn the item response functions (IRFs) that map from latent trait to observed response. However, in many cases observed behavior can deviate significantly from the parametric assumptions of traditional IRT models. Nonparametric IRT models overcome these challenges by relaxing assumptions about the form of the IRFs, but standard tools are unable to simultaneously estimate flexible IRFs and recover ability estimates for respondents. We propose a Bayesian nonparametric model that solves this problem by placing Gaussian process priors on the latent functions defining the IRFs. This allows us to simultaneously relax assumptions about the shape of the IRFs while preserving the ability to estimate latent traits. This in turn allows us to easily extend the model to further tasks such as active learning. GPIRT therefore provides a simple and intuitive solution to several longstanding problems in the IRT literature.

artificial intelligence, bayesian inference, machine learning, (18 more...)

2006.099

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
North America > United States > New York (0.04)
North America > United States > Massachusetts > Middlesex County > Reading (0.04)
Europe > Middle East > Cyprus (0.04)

Genre:

Research Report > Experimental Study (0.46)
Research Report > New Finding (0.46)

Industry: Government > Regional Government > North America Government > United States Government (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)

arXiv.org Machine LearningApr-17-2020

Deep Reinforcement Learning for Adaptive Learning Systems

Li, Xiao, Xu, Hanchen, Zhang, Jinming, Chang, Hua-hua

In this paper, we formulate the adaptive learning problem---the problem of how to find an individualized learning plan (called policy) that chooses the most appropriate learning materials based on learner's latent traits---faced in adaptive learning systems as a Markov decision process (MDP). We assume latent traits to be continuous with an unknown transition model. We apply a model-free deep reinforcement learning algorithm---the deep Q-learning algorithm---that can effectively find the optimal learning policy from data on learners' learning process without knowing the actual transition model of the learners' continuous latent traits. To efficiently utilize available data, we also develop a transition model estimator that emulates the learner's learning process using neural networks. The transition model estimator can be used in the deep Q-learning algorithm so that it can more efficiently discover the optimal learning policy for a learner. Numerical simulation studies verify that the proposed algorithm is very efficient in finding a good learning policy, especially with the aid of a transition model estimator, it can find the optimal learning policy after training using a small number of learners.

learner, machine learning, reinforcement learning, (14 more...)

2004.0841

Country:

North America > United States > Illinois (0.04)
North America > United States > North Carolina > Wake County > Raleigh (0.04)
North America > United States > New Jersey > Mercer County > Princeton (0.04)
(3 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Education > Educational Technology > Educational Software > Computer Based Training (0.93)
Education > Educational Setting > Online (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)

Benedetto, Luca, Cappelli, Andrea, Turrin, Roberto, Cremonesi, Paolo

R2DE: a NLP approach to estimating IRT parameters of newly generated questions

arXiv.org Machine LearningJan-21-2020

The main objective of exams consists in performing an assessment of students' expertise on a specific subject. Such expertise, also referred to as skill or knowledge level, can then be leveraged in different ways (e.g., to assign a grade to the students, to understand whether a student might need some support, etc.). Similarly, the questions appearing in the exams have to be assessed in some way before being used to evaluate students. Standard approaches to questions' assessment are either subjective (e.g., assessment by human experts) or introduce a long delay in the process of question generation (e.g., pretesting with real students). In this work we introduce R2DE (which is a Regressor for Difficulty and Discrimination Estimation), a model capable of assessing newly generated multiple-choice questions by looking at the text of the question and the text of the possible choices. In particular, it can estimate the difficulty and the discrimination of each question, as they are defined in Item Response Theory. We also present the results of extensive experiments we carried out on a real world large scale dataset coming from an e-learning platform, showing that our model can be used to perform an initial assessment of newly created questions and ease some of the problems that arise in question generation.

estimation, latent trait, student, (15 more...)

2001.07569

Country:

Europe > Germany > Hesse > Darmstadt Region > Frankfurt (0.05)
Europe > Switzerland (0.04)
Europe > Spain > Galicia > Madrid (0.04)
(2 more...)

Genre: Research Report (1.00)

Industry: Education > Educational Technology > Educational Software (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.68)

arXiv.org Artificial IntelligenceJun-1-2019

Enhancing Item Response Theory for Cognitive Diagnosis

Cheng, Song, Liu, Qi

Cognitive diagnosis is a fundamental and crucial task in many educational applications, e.g., computer adaptive test and cognitive assignments. Item Response Theory (IRT) is a classical cognitive diagnosis method which can provide interpretable parameters (i.e., student latent trait, question discrimination, and difficulty) for analyzing student performance. However, traditional IRT ignores the rich information in question texts, cannot diagnose knowledge concept proficiency, and it is inaccurate to diagnose the parameters for the questions which only appear several times. To this end, in this paper, we propose a general Deep Item Response Theory (DIRT) framework to enhance traditional IRT for cognitive diagnosis by exploiting semantic representation from question texts with deep learning. In DIRT, we first use a proficiency vector to represent students' proficiency in knowledge concepts and embed question texts and knowledge concepts to dense vectors by Word2Vec. Then, we design a deep diagnosis module to diagnose parameters in traditional IRT by deep learning techniques. Finally, with the diagnosed parameters, we input them into the logistic-like formula of IRT to predict student performance. Extensive experimental results on real-world data clearly demonstrate the effectiveness and interpretation power of DIRT framework.

artificial intelligence, machine learning, natural language, (17 more...)

1905.10957

Genre: Research Report (0.50)

Industry:

Education > Assessment & Standards > Student Performance (0.70)
Education > Educational Setting > Online (0.68)
Education > Educational Technology > Educational Software > Computer Based Training (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)